knitr::opts_chunk$set(
message = FALSE,
warning = FALSE
)
A foodborne disease outbreak is defined as two or more cases of similar illnesses caused by the same agent (e.g., a toxin, virus or bacteria), which are linked to the same food source. The Centers for Disease Control and Prevention (CDC) from USA estimates that each year roughly 1 in 6 Americans (or 48 million people) get sick, 128,000 are hospitalized and 3,000 die of foodborne diseases. However, only a small percentage of those cases are related to a foodborne disease outbreak. Additionally, it is thought that the number of foodborne illnesses reported to the health department is much less than the actual number of illnesses that occur every year (http://www.vdh.virginia.gov/environmental-health/food-safety-in-virginia/foodborne-diseases-and-outbreaks/foodborne-outbreaks/).
The collection of data on foodborne outbreaks is imperative to identify the pathogen and the food vehicle involved, providing relevant information to monitor the prevalence of foodborne disease. In addition, such data contribute to evaluate trends and can provide the basis for regulatory changes and public health actions to improve food safety and reduce illness and death caused by foodborne diseases.
The dataset, found on CDC, provides data on foodborne disease outbreaks reported from 1998 to 2015.
The data fields include year, state, location where the food was prepared, reported food vehicle and contaminated ingredient, etiology (the pathogen, toxin, or chemical that caused the illnesses), status (whether the etiology was confirmed or suspected), total illnesses, hospitalizations and fatalities.
Objective: Verify the quality of the dataset (dimensions of data quality), as well as explore the dataset in order to identify the main causative agents, foodstuffs and locations implicated in the foodborne disease outbreak episodes.
This project is an initial analysis (version 1). Further analysis, including statistical evaluation and predictive modeling, requires a deep data wrangling and will be performed later.
library(dplyr)
library(tidyverse)
library(ggplot2)
library(StatMeasures)
library(pander)
library(tidytext)
library(wordcloud)
library(RColorBrewer)
library(stringr)
library(plotly)
library(formattable)
library(reshape2)
setwd("C:/Users/bdeta/Documents/R/Projects/3 - Foodborne_disease_outbreaks")
df <- as.data.frame(read_csv("outbreaks.csv"))
Firstly, a brief data quality analysis was performed in order to characterize some dataset issues.
pandoc.table(contents(df)[,-6])
##
## -----------------------------------------------------------------------------------
## variable class distinctValues missingValues perMissingValues
## ------------------- ----------- ---------------- --------------- ------------------
## Year numeric 18 0 0
##
## Month character 12 0 0
##
## State character 55 0 0
##
## Location character 162 2166 11.33
##
## Food character 3128 8963 46.88
##
## Ingredient character 382 17243 90.19
##
## Species character 202 6619 34.62
##
## Serotype/Genotype character 240 15212 79.56
##
## Status character 23 6619 34.62
##
## Illnesses numeric 302 0 0
##
## Hospitalizations numeric 62 3625 18.96
##
## Fatalities numeric 13 3601 18.83
## -----------------------------------------------------------------------------------
tb1 <- as.data.frame(contents(df)[,-6])
## Unknown Values2
uv2 <- list(length(which(df$State=="Unknown")),
length(which(df$Month=="Unknown")),
length(which(df$Year=="Unknown")),
length(which(df$Location=="Unknown")),
length(which(df$Food=="Unknown Food")),
length(which(df$Ingredient=="Unknown")),
length(which(df$Species=="Unknown")),
length(which(df$`Serotype/Genotype`=="Unknown")),
length(which(df$Status=="Unknown")),
length(which(df$Illnesses=="Unknown")),
length(which(df$Hospitalizations=="Unknown")),
length(which(df$Fatalities=="Unknown"))
)
#cbind missing and unknown
df.m.u <- cbind(tb1, t(as.data.frame(uv2)))
df.m.u2 <- df.m.u %>%
select(variable, missingValues, `t(as.data.frame(uv2))`)
colnames(df.m.u2)[3] <- "unknown"
df.m.u2 <- melt(df.m.u2)
colnames(df.m.u2)[2] <- "variable2"
# Missing Values and unknown
ggplotly(ggplot(data=df.m.u2, aes(x=reorder(variable, -value), y=value, fill=variable2)) + geom_bar(stat="identity", colour="Black", size=.2) + xlab("Variables") +
ggtitle("Missing and Unknown Values") + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "left", plot.title = element_text(face="bold", size=10), legend.title = element_blank()) +
scale_fill_brewer(palette="Paired")) %>%
layout(
xaxis = list(
ticktext = list("Ingredient", "Serotype/Genotype", "Food", "Species", "Status", "Hospitalizations", "Fatalities", "Location", "Illnesses", "Month", "State", "Year"), showticklabels = TRUE, tickvals = list(1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
tickmode = "array"))
There is a total of 64,048 NA (not available) fields on the dataset. Some variables such as Ingredient, Serotype/Genotype, Food, Species and Status have high percentages of missing values, ranging from 34% to 90% of the fields.
The value reported as “unknown” is present in 1,072 fields and does not contribute to the data analysis as well.
This total of 65,120 blank fields represents 28% of the dataset and may compromise the data analysis since it’s not possible to identify the main elements (pathogen and vehicle) involved in each outbreak. However, this data quality issue is a reality in many outbreak investigations, in which a specific key variable is not identified.
This problem, linked to completeness dimension data quality, could be minimized by improving the survey/questionnaire form, so that the data collection instrument does not allow blank fields and captures why the value is missing instead (design a data entry form including options like “don’t know”, “not applicable”, “decline to answer”, “pending”, “missing”, “missed/not done”).
# Distinct Values
ggplotly(ggplot(tb1, aes(x=reorder(variable, -distinctValues), y=distinctValues, fill=distinctValues)) + geom_col(colour="black", size=.2) + xlab("Variables") + scale_fill_gradient(low="#c7eae5",high="#469c95") +
ggtitle("Distinct Values") + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none", plot.title = element_text(face="bold", size=10))) %>%
layout(
xaxis = list(
ticktext = list("Food", "Ingredient", "Illnesses", "Serotype/Genotype", "Species", "Location", "Hospitalizations", "State", "Status", "Year", "Fatalities", "Month"), showticklabels = TRUE, tickvals = list(1,2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
tickmode = "array"))
The Food variable, for example, has 3,128 distinct values, which may lead to difficulties in identify the number of outbreaks attributed to a particular food item. Again, this problem, linked to consistency dimension data quality, could be reduced by improving the survey/questionnaire form. The item “rice”, for example, has the following 20 specifications: rice; rice and beans; rice and peas; rice bowl, unspecified; rice crispy treat; rice dishes; rice dishes, unspecified; rice pilaf; rice, broccoli and cheese; rice, brown; rice, fried rice; rice, fried, vegetable; rice, spanish; rice, steamed; rice, sticky; rice, unspecified; rice, unspecified (source unknown); rice, white; rice, wild; rice, yellow.
df %>%
select_if(is.numeric) %>%
summary()
## Year Illnesses Hospitalizations Fatalities
## Min. :1998 Min. : 2.00 Min. : 0.000 Min. : 0.000
## 1st Qu.:2001 1st Qu.: 3.00 1st Qu.: 0.000 1st Qu.: 0.000
## Median :2005 Median : 8.00 Median : 0.000 Median : 0.000
## Mean :2006 Mean : 19.54 Mean : 0.948 Mean : 0.022
## 3rd Qu.:2010 3rd Qu.: 19.00 3rd Qu.: 1.000 3rd Qu.: 0.000
## Max. :2015 Max. :1939.00 Max. :308.000 Max. :33.000
## NA's :3625 NA's :3601
Regarding numeric variables, especially Illnesses, Hospitalizations and Fatalities, the data seem to be consistent, except for the high number of NA (not available) fields in Hospitalizations (3,625) and Fatalities (3,601). This problem, linked to completeness dimension data quality, could be reduced by improving the patient follow-up in the healthcare system.
Multiple elements in the same field are another issue identified. For example, in the column Species there is a row filled in with “Salmonella enterica; Salmonella enterica; Salmonella enterica; Salmonella enterica”.
The following table show the number of outbreaks (rows) per column filled with more than one element:
multiple.fields.1 <- df %>%
filter(str_detect(Location, ";")) %>%
summarise(Location = n())
multiple.fields.2 <- df %>%
filter(str_detect(Food, ";")) %>%
summarise(Food = n())
multiple.fields.3 <- df %>%
filter(str_detect(Species, ";")) %>%
summarise(Species = n())
multiple.fields.4 <- df %>%
filter(str_detect(Status, ";")) %>%
summarise(Status = n())
multiple.fields <- cbind(multiple.fields.1, multiple.fields.2, multiple.fields.3, multiple.fields.4)
formattable(multiple.fields, list(area(col = Location:Status) ~ color_tile("#c7e9b4", "#41b6c4")))
| Location | Food | Species | Status |
|---|---|---|---|
| 1121 | 1889 | 521 | 523 |
There were identified 1,121 outbreaks related with more than one location, 1,889 related with more than one food item, 521 related with more than one pathogen (species) and 523 outbreaks related with more than one status (confirmed or suspected). This data arrangement can make analysis difficult, mainly regarding the numeric columns such as illnesses, hospitalizations and fatalities, because the multiple elements in each field correspond to a single value in the numeric columns.
Perhaps, unfold each outbreak (with more than one element per column) into more than one row and assign a specific “id” for each outbreak could be a good way to solve this problem.
The foodborne disease events (outbreaks, illnesses, hospitalizations and fatalities) were analyzed over the years (1998-2015).
df.num <- df %>%
group_by(Year) %>%
mutate(Hospitalizations = replace_na(Hospitalizations, 0)) %>%
mutate(Fatalities = replace_na(Fatalities, 0)) %>%
summarise(outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
formattable(df.num, list(
Year = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
outbreaks = color_bar("#b8ddf2"),
illnesses = color_bar("#6ed46e"),
hospitalizations = color_bar("#f7d383"),
fatalities = color_bar("#eb724d")))
| Year | outbreaks | illnesses | hospitalizations | fatalities |
|---|---|---|---|---|
| 1998 | 1317 | 27156 | 886 | 33 |
| 1999 | 1337 | 24899 | 598 | 10 |
| 2000 | 1405 | 26033 | 728 | 22 |
| 2001 | 1248 | 25192 | 665 | 11 |
| 2002 | 1320 | 24939 | 734 | 14 |
| 2003 | 1089 | 23079 | 687 | 24 |
| 2004 | 1328 | 29034 | 779 | 22 |
| 2005 | 959 | 19761 | 592 | 8 |
| 2006 | 1255 | 28656 | 1170 | 10 |
| 2007 | 1088 | 20970 | 877 | 18 |
| 2008 | 1029 | 23089 | 1250 | 22 |
| 2009 | 669 | 13813 | 546 | 7 |
| 2010 | 853 | 15893 | 662 | 20 |
| 2011 | 796 | 14278 | 952 | 45 |
| 2012 | 833 | 14995 | 859 | 20 |
| 2013 | 824 | 13431 | 1051 | 14 |
| 2014 | 872 | 13295 | 722 | 23 |
| 2015 | 897 | 15018 | 923 | 14 |
p1 <- ggplotly(ggplot(df.num, aes(x = Year, y = outbreaks)) +
geom_line(colour="#b8ddf2") +
geom_point(size=1, colour="#b8ddf2") +
scale_x_continuous(breaks = df.num$Year))
p2 <- ggplotly(ggplot(df.num, aes(x = Year, y = illnesses)) +
geom_line(colour="#6ed46e") +
geom_point(size=1, colour="#6ed46e") +
scale_x_continuous(breaks = df.num$Year))
p3 <- ggplotly(ggplot(df.num, aes(x = Year, y = hospitalizations)) +
geom_line(colour="#f7d383") +
geom_point(size=1, colour="#f7d383") +
scale_x_continuous(breaks = df.num$Year))
p4 <- ggplotly(ggplot(df.num, aes(x = Year, y = fatalities)) +
geom_line(colour="#eb724d") +
geom_point(size=1, colour="#eb724d") +
scale_x_continuous(breaks = df.num$Year) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)))
subplot(p1, p2, p3, p4, nrows = 4, shareX = TRUE, margin = 0.02) %>% layout(annotations = list(
list(
x = .045,
y = 0.995,
font = list(size = 11),
text = "Outbreaks",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = .036,
y = 0.725,
font = list(size = 11),
text = "Illnesses",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = .07,
y = 0.475,
font = list(size = 11),
text = "Hospitalizations",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = .038,
y = 0.224,
font = list(size = 11),
text = "Fatalities",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
)
))
With regards to the time series analysis, the number of outbreaks and illnesses showed a meaningful reduction until 2009. From 2010 to 2015, these events showed constant values, which could indicates that there were no significant increases in outbreaks occurrence, neither large outbreaks (high rates of reported foodborne illnesses) in the last 5 years analyzed. The fatalities events peaked in 2011, as well as the hospitalizations in 2006 and 2008.
The foodborne disease events (outbreaks, illnesses, hospitalizations and fatalities) were analyzed by month (January to December).
df.num2 <- df %>%
mutate(Month = factor(Month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))) %>%
group_by(Month) %>%
mutate(Hospitalizations = replace_na(Hospitalizations, 0)) %>%
mutate(Fatalities = replace_na(Fatalities, 0)) %>%
summarise(outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
formattable(df.num2, list(
Month = formatter("span", style = ~ style(color = "grey",font.weight = "bold")),
outbreaks = color_bar("#b8ddf2"),
illnesses = color_bar("#6ed46e"),
hospitalizations = color_bar("#f7d383"),
fatalities = color_bar("#eb724d")))
| Month | outbreaks | illnesses | hospitalizations | fatalities |
|---|---|---|---|---|
| January | 1515 | 29145 | 784 | 24 |
| February | 1448 | 29639 | 774 | 26 |
| March | 1724 | 32178 | 1251 | 20 |
| April | 1725 | 37557 | 1531 | 17 |
| May | 1898 | 36704 | 1213 | 25 |
| June | 1819 | 34632 | 1607 | 31 |
| July | 1538 | 29713 | 1964 | 69 |
| August | 1533 | 28210 | 1469 | 26 |
| September | 1247 | 24022 | 1140 | 24 |
| October | 1347 | 26632 | 1233 | 44 |
| November | 1509 | 31314 | 1021 | 19 |
| December | 1816 | 33785 | 694 | 12 |
p1 <- ggplotly(ggplot(df.num2, aes(x = as.factor(Month), y = outbreaks, group=1)) +
geom_line(colour="#b8ddf2") +
geom_point(size=1, colour="#b8ddf2"))
p2 <- ggplotly(ggplot(df.num2, aes(x = Month, y = illnesses, group=1)) +
geom_line(colour="#6ed46e") +
geom_point(size=1, colour="#6ed46e") +
scale_y_continuous(limits = c(23700, 38000)))
p3 <- ggplotly(ggplot(df.num2, aes(x = Month, y = hospitalizations, group=2)) +
geom_line(colour="#f7d383") +
geom_point(size=1, colour="#f7d383") +
scale_y_continuous(limits = c(690, 2300)))
p4 <- ggplotly(ggplot(df.num2, aes(x = Month, y = fatalities, group=2)) +
geom_line(colour="#eb724d") +
geom_point(size=1, colour="#eb724d") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)))
subplot(p1, p2, p3, p4, nrows = 4, shareX = TRUE, margin = 0.02) %>% layout(annotations = list(
list(
x = .045,
y = 0.995,
font = list(size = 11),
text = "Outbreaks",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = .036,
y = 0.72,
font = list(size = 11),
text = "Illnesses",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = .07,
y = 0.475,
font = list(size = 11),
text = "Hospitalizations",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = .038,
y = 0.223,
font = list(size = 11),
text = "Fatalities",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
)
))
It seems to be an annual variation of outbreaks, illnesses, hospitalizations and fatalities over the months. The number of outbreaks and illnesses increased between March and June, just like in the December. The number of hospitalizations and fatalities tends to be higher in June, July and August (summer season). There was also a peak of fatalities in October.
The number of illnesses episodes across the United States is shown on the map visualization:
df.maps <- df %>%
mutate(Code = state.abb[match(State, state.name)])
df.maps.ill <- df.maps %>%
group_by(Code, State) %>%
summarise(illnesses = sum(Illnesses))
g <- list(
scope = 'usa',
projection = list(type = 'albers usa')
)
plot_geo(df.maps.ill, locationmode = 'USA-states') %>%
add_trace(
z = ~illnesses, text= ~State, locations = ~Code,
color = ~illnesses, colors = 'Greens'
) %>%
colorbar(title = "Illnesses") %>%
layout(
title = 'Foodborne disease illnesses events',
geo = g,
font = list(
size = 11)
)
An interesting analysis lies on the place of exposure to contaminated foods, which can indicates what location for food preparation poses the greatest risk of foodborne events (outbreaks, illnesses, hospitalizations and fatalities). The word cloud below shows the frequency of each word/expression appears in the Location column.
# Location
df.location <- df %>%
unnest_tokens(word, Location, token = 'regex', pattern=";", collapse = FALSE)
df.location$word <- trimws(df.location$word, which = c("left"))
df.location.count <- df.location %>%
count(word, sort = TRUE)
set.seed(1234)
wordcloud(words = df.location.count$word, freq = df.location.count$n, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(6, "Spectral"))
df.location.1 <- df %>%
filter(str_detect(Location, "Restaurant")) %>%
summarise(Location = "Restaurant", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.location.2 <- df %>%
filter(str_detect(Location, "Private Home/Residence")) %>%
summarise(Location = "Private Home/Residence", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.location.3 <- df %>%
filter(str_detect(Location, "Catering Service")) %>%
summarise(Location = "Catering Service", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.location.4 <- df %>%
filter(str_detect(Location, "Grocery Store")) %>%
summarise(Location = "Grocery Store", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.location.5 <- df %>%
filter(str_detect(Location, "Banquet Facility")) %>%
summarise(Location = "Banquet Facility", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.location.6 <- df %>%
filter(str_detect(Location, "Fast Food Restaurant")) %>%
summarise(Location = "Fast Food Restaurant", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.location.full <- rbind(df.location.1, df.location.2, df.location.3, df.location.4, df.location.5, df.location.6)
formattable(df.location.full, list(
outbreaks = color_bar("#b8ddf2"),
illnesses = color_bar("#6ed46e"),
hospitalizations = color_bar("#f7d383"),
fatalities = color_bar("#eb724d")))
| Location | outbreaks | illnesses | hospitalizations | fatalities |
|---|---|---|---|---|
| Restaurant | 11536 | 159240 | 6121 | 78 |
| Private Home/Residence | 2314 | 39528 | 3437 | 92 |
| Catering Service | 1466 | 50145 | 711 | 9 |
| Grocery Store | 599 | 10223 | 775 | 15 |
| Banquet Facility | 476 | 18460 | 190 | 1 |
| Fast Food Restaurant | 434 | 6371 | 577 | 3 |
Restaurant, for example, was the most frequent place of exposure in the outbreaks reported. It was present in 11,536 outbreaks and was related to 159,240 illnesses, 6,121 hospitalizations and 78 fatalities events. This local seems to pose the greatest risk of foodborne events, although private home/residence was related with more fatalities events (92).
The main food items related to each foodborne disease events were also analyzed.
# Food
df.food <- df %>%
unnest_tokens(word, Food, token = 'regex', pattern=";", collapse = FALSE)
df.food$word <- trimws(df.food$word, which = c("left"))
df.food.count <- df.food %>%
filter(word != "na") %>%
count(word, sort = TRUE)
set.seed(1234)
wordcloud(words = df.food.count$word, freq = df.food.count$n, min.freq = 1,
max.words=100, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(4, "Spectral"))
In the outbreaks which the food could be identified, the expressions “multiple foods”, “oysters, raw”, “salad, unspecified” and “potato salad” were the most frequent, as shown on the word cloud above.
In order to improve the analysis, the main components of all food expressions reported were extracted.
df.food.1 <- df %>%
filter(str_detect(Food, "Chicken")) %>%
summarise(Food = "Chicken", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.2 <- df %>%
filter(str_detect(Food, "Salad")) %>%
summarise(Food = "Salad", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.3 <- df %>%
filter(str_detect(Food, "Beef")) %>%
summarise(Food = "Beef", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.4 <- df %>%
filter(str_detect(Food, "Fish")) %>%
summarise(Food = "Fish", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.5 <- df %>%
filter(str_detect(Food, "Rice")) %>%
summarise(Food = "Rice", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.6 <- df %>%
filter(str_detect(Food, "Pork")) %>%
summarise(Food = "Pork", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.7 <- df %>%
filter(str_detect(Food, "Turkey")) %>%
summarise(Food = "Turkey", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.11 <- df %>%
filter(str_detect(Food, "Milk")) %>%
summarise(Food = "Milk", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.10 <- df %>%
filter(str_detect(Food, "Oyster")) %>%
summarise(Food = "Oyster", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.8 <- df %>%
filter(str_detect(Food, "Meat")) %>%
summarise(Food = "Meat", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.9 <- df %>%
filter(str_detect(Food, "Cheese")) %>%
summarise(Food = "Cheese", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.12 <- df %>%
filter(str_detect(Food, "Fruit")) %>%
summarise(Food = "Fruit", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.13 <- df %>%
filter(str_detect(Food, "Egg")) %>%
summarise(Food = "Egg", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.14 <- df %>%
filter(str_detect(Food, "Potato")) %>%
summarise(Food = "Potato", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.food.full <- rbind(df.food.1, df.food.2, df.food.3, df.food.4, df.food.5, df.food.6, df.food.7, df.food.8, df.food.9, df.food.10, df.food.11, df.food.12, df.food.13, df.food.14)
formattable(df.food.full, list(
outbreaks = color_bar("#b8ddf2"),
illnesses = color_bar("#6ed46e"),
hospitalizations = color_bar("#f7d383"),
fatalities = color_bar("#eb724d")))
| Food | outbreaks | illnesses | hospitalizations | fatalities |
|---|---|---|---|---|
| Chicken | 1309 | 23464 | 1058 | 11 |
| Salad | 1331 | 36968 | 763 | 18 |
| Beef | 958 | 18962 | 975 | 18 |
| Fish | 644 | 3704 | 239 | 3 |
| Rice | 406 | 4925 | 152 | 9 |
| Pork | 390 | 8710 | 575 | 5 |
| Turkey | 336 | 10680 | 364 | 21 |
| Meat | 295 | 5103 | 214 | 23 |
| Cheese | 366 | 6924 | 547 | 18 |
| Oyster | 244 | 2742 | 73 | 2 |
| Milk | 217 | 5375 | 241 | 6 |
| Fruit | 171 | 5754 | 98 | 2 |
| Egg | 252 | 6739 | 341 | 3 |
| Potato | 291 | 9836 | 232 | 7 |
Chicken appeared in 1,309 outbreak events and was related to 23,464 illnesses, 1,058 hospitalizations and 11 fatalities events, while Salad appeared in 1,331 outbreak events and was related to 36,968 illnesses, 763 hospitalizations and 18 fatalities events. Regarding to fatalities, the most frequent food involved was Meat (23 events), followed by Turkey (21 events).
Another important issue is related to the contaminant/pathogen that has been responsible for the largest number of outbreaks, illnesses, hospitalizations and deaths in the United States. In this dataset, the contaminant/pathogen is unknown or NA (not available) in 6,635 (34.7%) foodborne disease outbreaks.
Besides, it is important to point out that, when reported, the contaminant/pathogen status can be “confirmed” or “suspected”. Thus, this classification was evaluated in order to verify the degree of certainty of the outbreak causative agent identification.
#Confirmed and Suspected
df.species.c.s <- df %>%
unnest_tokens(species, Species, token = 'regex', pattern=";", collapse = FALSE) %>%
unnest_tokens(status, Status, token = 'regex', pattern=";", collapse = FALSE)
df.species.c.s$species <- trimws(df.species.c.s$species, which = c("left"))
df.species.c.s$status <- trimws(df.species.c.s$status, which = c("left"))
df.species.c.s %<>%
select(species, status) %>%
group_by(species, status) %>%
summarise(cont = n())
df.c.s.full <- df.species.c.s %>%
group_by(status) %>%
summarise(total = sum(cont)) %>%
mutate(`total_per (%)` = round((total / sum(total))*100, digits = 2))
formattable(df.c.s.full, list(
total = color_bar("#fa9fb5"),
`total_per (%)` = color_bar("#fcc5c0")))
| status | total | total_per (%) |
|---|---|---|
| confirmed | 8947 | 42.18 |
| na | 6619 | 31.21 |
| suspected | 5645 | 26.61 |
As shown below, the causative agent was not confirmed in the most of the outbreak events (na + suspected = 57.82%).
The causative agents initially identified were grouped by type to better summarize the most frequent ones which were confirmed.
#Genus
df.species.c.s$genus <- word(df.species.c.s$species, 1)
df.species.c.s.full <- df.species.c.s %>%
filter(genus != "na" & genus != "unknown") %>%
group_by(genus, status) %>%
summarise(cont = sum(cont))
df.genus <- dcast(df.species.c.s.full, status ~ genus, value.var="cont", fun.aggregate=sum)
df.genus <- t(df.genus[,-1])
colnames(df.genus) <- c("confirmed", "suspected")
df.genus <- as.data.frame(df.genus)
df.genus <- df.genus[order(-df.genus$confirmed),]
temp <- df.genus %>%
mutate(total = (confirmed + suspected))
df.genus2 <- cbind(df.genus, temp[,3])
df.genus2 <- df.genus2 %>%
select(-suspected)
colnames(df.genus2)[2] <- "total"
formattable(df.genus2, list(
confirmed = color_tile("#fff7bc", "#faec84"),
total = color_tile("#b1afc7", "#8b86c4")))
| confirmed | total | |
|---|---|---|
| norovirus | 3238 | 5513 |
| salmonella | 2632 | 3044 |
| escherichia | 540 | 626 |
| clostridium | 475 | 1300 |
| campylobacter | 383 | 498 |
| scombroid | 319 | 389 |
| staphylococcus | 253 | 755 |
| ciguatoxin | 236 | 272 |
| shigella | 163 | 205 |
| bacillus | 146 | 946 |
| vibrio | 102 | 165 |
| hepatitis | 87 | 88 |
| listeria | 60 | 63 |
| cryptosporidium | 42 | 53 |
| chemical | 37 | 156 |
| histamine | 37 | 44 |
| cyclospora | 28 | 31 |
| mycotoxins | 24 | 30 |
| giardia | 22 | 24 |
| trichinella | 18 | 19 |
| yersinia | 16 | 17 |
| paralytic | 12 | 17 |
| bacterium | 8 | 127 |
| heavy | 8 | 9 |
| rotavirus | 7 | 15 |
| sapovirus | 7 | 8 |
| virus | 7 | 101 |
| streptococcus | 6 | 10 |
| brucella | 5 | 5 |
| enterobacter | 5 | 6 |
| pesticides | 4 | 4 |
| neurotoxic | 3 | 8 |
| plant | 3 | 5 |
| puffer | 2 | 3 |
| amnesic | 1 | 1 |
| anisakis | 1 | 1 |
| astrovirus | 1 | 2 |
| cleaning | 1 | 8 |
| enterococcus | 1 | 1 |
| monosodium | 1 | 1 |
| parasite | 1 | 2 |
| adenovirus | 0 | 4 |
– Regarding the outbreaks in which the causative agent was identified, confirmed or not, the species Norovirus genogroup I, Salmonella enterica, Norovirus genogroup II and Clostridium perfringens were related to the largest number of outbreaks.
# Species
df.species <- df %>%
unnest_tokens(word, Species, token = 'regex', pattern=";", collapse = FALSE)
df.species$word <- trimws(df.species$word, which = c("left"))
df.species.count <- df.species %>%
filter(word != "na") %>%
count(word, sort = TRUE)
set.seed(1234)
wordcloud(words = df.species.count$word, freq = df.species.count$n,
max.words=50, random.order=FALSE, rot.per=0.5, min.freq = 1, scale=c(3,.5), colors=brewer.pal(4, "Spectral"))
The species Norovirus genogroup I appeared in 4,227 outbreak events and was related to 117,121 (31.4%) illnesses, 1,213 hospitalizations and 8 (2.4%) fatalities events. Listeria monocytogenes was implicated in the most case fatalities reported (121, 35.9%), followed by Salmonella enterica (86, 25.5%). The Listeria monocytogenes, Clostridium botulinum and Mycotoxins were associated with the highest case fatality rates (the proportion of deaths within a “case”).
df.species.1 <- df %>%
filter(str_detect(Species, "Norovirus genogroup I")) %>%
summarise(Species = "Norovirus genogroup I", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.2 <- df %>%
filter(str_detect(Species, "Salmonella enterica")) %>%
summarise(Species = "Salmonella enterica", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.3 <- df %>%
filter(str_detect(Species, "Norovirus genogroup II")) %>%
summarise(Species = "Norovirus genogroup II", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.4 <- df %>%
filter(str_detect(Species, "Clostridium perfringens")) %>%
summarise(Species = "Clostridium perfringens", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.5 <- df %>%
filter(str_detect(Species, "Norovirus unknown") | Species=="Norovirus") %>%
summarise(Species = "Norovirus", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.6 <- df %>%
filter(str_detect(Species, "Staphylococcus aureus")) %>%
summarise(Species = "Staphylococcus aureus", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.7 <- df %>%
filter(str_detect(Species, "Bacillus cereus")) %>%
summarise(Species = "Bacillus cereus", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.8 <- df %>%
filter(str_detect(Species, "Escherichia coli, Shiga toxin-producing")) %>%
summarise(Species = "Escherichia coli, Shiga toxin-producing", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.9 <- df %>%
filter(str_detect(Species, "Scombroid toxin")) %>%
summarise(Species = "Scombroid toxin", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.10 <- df %>%
filter(str_detect(Species, "Campylobacter jejuni")) %>%
summarise(Species = "Campylobacter jejuni", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.11 <- df %>%
filter(str_detect(Species, "Ciguatoxin")) %>%
summarise(Species = "Ciguatoxin", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.12 <- df %>%
filter(str_detect(Species, "Shigella sonnei")) %>%
summarise(Species = "Shigella sonnei", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.13 <- df %>%
filter(str_detect(Species, "Vibrio parahaemolyticus")) %>%
summarise(Species = "Vibrio parahaemolyticus", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.14 <- df %>%
filter(str_detect(Species, "Hepatitis A")) %>%
summarise(Species = "Hepatitis A", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.15 <- df %>%
filter(str_detect(Species, "Listeria monocytogenes")) %>%
summarise(Species = "Listeria monocytogenes", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.16 <- df %>%
filter(str_detect(Species, "Clostridium botulinum")) %>%
summarise(Species = "Clostridium botulinum", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.17 <- df %>%
filter(str_detect(Species, "Histamine")) %>%
summarise(Species = "Histamine", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.18 <- df %>%
filter(str_detect(Species, "Cyclospora cayatenensis")) %>%
summarise(Species = "Cyclospora cayatenensis", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.19 <- df %>%
filter(str_detect(Species, "Mycotoxins")) %>%
summarise(Species = "Mycotoxins", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.20 <- df %>%
filter(str_detect(Species, "Escherichia coli, Enteropathogenic")) %>%
summarise(Species = "Escherichia coli, Enteropathogenic", outbreaks = n(), illnesses = sum(Illnesses), illnesses_per = round((sum(Illnesses)/sum(df$Illnesses))*100, digits=1), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities), fatalities_per = round((sum(Fatalities)/sum(df$Fatalities))*100, digits=1), case_fatality_rate = round( (sum(Fatalities)/sum(Illnesses))*100, digits=2))
df.species.full <- rbind(df.species.1, df.species.2, df.species.3, df.species.4, df.species.5, df.species.6, df.species.7, df.species.8, df.species.9, df.species.10, df.species.11, df.species.12, df.species.13, df.species.14, df.species.15, df.species.16, df.species.17, df.species.18, df.species.19, df.species.20)
formattable(df.species.full, list(
outbreaks = color_bar("#b8ddf2"),
illnesses = color_bar("#6ed46e"),
illnesses_per = color_bar("#96d696"),
hospitalizations = color_bar("#f7d383"),
fatalities = color_bar("#eb724d"),
fatalities_per = color_bar("#eb9f88"),
case_fatality_rate = color_bar("#bdbdbd")))
| Species | outbreaks | illnesses | illnesses_per | hospitalizations | fatalities | fatalities_per | case_fatality_rate |
|---|---|---|---|---|---|---|---|
| Norovirus genogroup I | 4227 | 117121 | 31.4 | 1213 | 8 | 2.4 | 0.01 |
| Salmonella enterica | 2403 | 65351 | 17.5 | 7480 | 86 | 25.5 | 0.13 |
| Norovirus genogroup II | 1462 | 39667 | 10.6 | 531 | 6 | 1.8 | 0.02 |
| Clostridium perfringens | 979 | 33139 | 8.9 | 162 | 15 | 4.5 | 0.05 |
| Norovirus | 1131 | 23103 | 6.2 | 212 | 5 | 1.5 | 0.02 |
| Staphylococcus aureus | 630 | 9649 | 2.6 | 499 | 3 | 0.9 | 0.03 |
| Bacillus cereus | 613 | 7143 | 1.9 | 75 | 3 | 0.9 | 0.04 |
| Escherichia coli, Shiga toxin-producing | 507 | 9401 | 2.5 | 1969 | 35 | 10.4 | 0.37 |
| Scombroid toxin | 389 | 1459 | 0.4 | 48 | 0 | 0.0 | 0.00 |
| Campylobacter jejuni | 308 | 6997 | 1.9 | 252 | 1 | 0.3 | 0.01 |
| Ciguatoxin | 271 | 1073 | 0.3 | 131 | 1 | 0.3 | 0.09 |
| Shigella sonnei | 133 | 6260 | 1.7 | 226 | 2 | 0.6 | 0.03 |
| Vibrio parahaemolyticus | 125 | 1654 | 0.4 | 50 | 0 | 0.0 | 0.00 |
| Hepatitis A | 88 | 2402 | 0.6 | 244 | 5 | 1.5 | 0.21 |
| Listeria monocytogenes | 60 | 816 | 0.2 | 578 | 121 | 35.9 | 14.83 |
| Clostridium botulinum | 57 | 220 | 0.1 | 166 | 10 | 3.0 | 4.55 |
| Histamine | 44 | 220 | 0.1 | 13 | 0 | 0.0 | 0.00 |
| Cyclospora cayatenensis | 30 | 1682 | 0.5 | 29 | 0 | 0.0 | 0.00 |
| Mycotoxins | 30 | 170 | 0.0 | 72 | 7 | 2.1 | 4.12 |
| Escherichia coli, Enteropathogenic | 29 | 1346 | 0.4 | 18 | 0 | 0.0 | 0.00 |
The contribution of each causative agent for illnesses and fatalities events is shown below:
# The Species more related to illnesses
df.species.temp2 <- df.species.full %>%
filter(illnesses_per >= 2)
df.others2 <- data.frame("Species" = "Others", "outbreaks" = 0, "illnesses" = 0, "illnesses_per" = (100-sum(df.species.temp2$illnesses_per)), "hospitalizations" = 0, "fatalities" = 0, "fatalities_per" = (100-sum(df.species.temp2$fatalities_per)), "case_fatality_rate" = 0)
df.species.fatal2 <- rbind(df.species.temp2, df.others2)
colors2 = colorRampPalette(brewer.pal(5, "Greens"))(40)[df.species.fatal2$illnesses_per]
plot_ly(df.species.fatal2, labels = ~Species, values = ~illnesses_per, marker = list(colors = colors2, line = list(color = 'White', width = 0.7))) %>%
add_pie(hole = 0.6) %>%
layout(title = "Pathogen contributions to illnesses (%)", showlegend = T, font = list(
size = 11),
xaxis = list(showgrid = F, zeroline = F, showticklabels = F),
yaxis = list(showgrid = F, zeroline = F, showticklabels = F))
The donut plot showed that Norovirus genogroup I, Salmonella enterica and Norovirus genogroup II were related to approximately 60% of illnesses episodes.
# The Species more related to fatalities
df.species.temp <- df.species.full %>%
filter(fatalities_per >= 2)
df.others <- data.frame("Species" = "Others", "outbreaks" = 0, "illnesses" = 0, "illnesses_per" = (100-sum(df.species.temp$illnesses_per)), "hospitalizations" = 0, "fatalities" = 0, "fatalities_per" = (100-sum(df.species.temp$fatalities_per)), "case_fatality_rate" = 0)
df.species.fatal <- rbind(df.species.temp, df.others)
colors = colorRampPalette(brewer.pal(5, "OrRd"))(40)[df.species.fatal$fatalities_per]
plot_ly(df.species.fatal, labels = ~Species, values = ~fatalities_per, marker = list(colors = colors, line = list(color = 'White', width = 0.7))) %>%
add_pie(hole = 0.6) %>%
layout(title = "Pathogen contributions to fatalities (%)", showlegend = T, font = list(
size = 11),
xaxis = list(showgrid = F, zeroline = F, showticklabels = F),
yaxis = list(showgrid = F, zeroline = F, showticklabels = F))
Listeria monocytogenes and Salmonella enterica were implicated in 61.4% of all fatalities events.
–
Additionally, time series analyzes of the causative agents over the years (1998-2015) were performed. This analysis is relevant in order to verify uncommon foodborne outbreak episodes.
# Time Series and Species
df.species.t.1 <- df %>%
filter(str_detect(Species, "Listeria monocytogenes")) %>%
group_by(Year) %>%
summarise(Species = "Listeria monocytogenes", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.species.t.2 <- df %>%
filter(str_detect(Species, "Salmonella enterica")) %>%
group_by(Year) %>%
summarise(Species = "Salmonella enterica", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.species.t.3 <- df %>%
filter(str_detect(Species, "Escherichia coli, Shiga toxin-producing")) %>%
group_by(Year) %>%
summarise(Species = "Escherichia coli, Shiga toxin-producing", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.species.t.4 <- df %>%
filter(str_detect(Species, "Clostridium perfringens")) %>%
group_by(Year) %>%
summarise(Species = "Clostridium perfringens", outbreaks = n(), illnesses = sum(Illnesses), hospitalizations = sum(Hospitalizations), fatalities = sum(Fatalities))
df.species.t.full <- rbind(df.species.t.1, df.species.t.2, df.species.t.3, df.species.t.4)
ggplotly(ggplot(df.species.t.full, aes(x = Year, y = illnesses)) +
geom_line(aes(color = Species)) +
scale_color_manual(values = c("#b8ddf2", "#6ed46e", "#f7d383", "#eb724d")) +
scale_x_continuous(breaks = df.species.t.full$Year) +
ggtitle("Illnesses Time Series") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(face="bold", size = 10))) %>%
layout(legend = list(y = -0.2, orientation = 'h', xanchor = "center", x = 0.5))
m <- df.species.t.full[which.max(df.species.t.full$hospitalizations), ]
a <- list(
x = m$Year,
y = m$hospitalizations,
text = c(m$Species),
xref = "x",
yref = "y",
showarrow = TRUE,
arrowhead = 4,
arrowsize = .5,
ax = 20,
ay = -40
)
ggplotly(ggplot(df.species.t.full, aes(x = Year, y = hospitalizations)) +
geom_line(aes(color = Species)) +
scale_color_manual(values = c("#b8ddf2", "#6ed46e", "#f7d383", "#eb724d")) +
scale_x_continuous(breaks = df.species.t.full$Year) +
ggtitle("Hospitalizations Time Series") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(face="bold", size = 10))) %>%
add_markers() %>%
layout(legend = list(y = -0.2, orientation = 'h', xanchor = "center", x = 0.5), annotations = a)
m <- df.species.t.full[which.max(df.species.t.full$fatalities), ]
a <- list(
x = m$Year,
y = m$fatalities,
text = c(m$Species),
xref = "x",
yref = "y",
showarrow = TRUE,
arrowhead = 4,
arrowsize = .5,
ax = 20,
ay = -40
)
ggplotly(ggplot(df.species.t.full, aes(x = Year, y = fatalities)) +
geom_line(aes(color = Species)) +
scale_color_manual(values = c("#b8ddf2", "#6ed46e", "#f7d383", "#eb724d")) +
scale_x_continuous(breaks = df.species.t.full$Year) +
ggtitle("Fatalities Time Series") +
theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(face="bold", size=10))) %>%
add_markers() %>%
layout(legend = list(y = -0.2, orientation = 'h', xanchor = "center", x = 0.5), annotations = a)
This analysis showed a high peak in hospitalizations due to Salmonella enterica in 2008 and a high peak in case fatalities due to Listeria monocytogenes in 2011.
Considering the causative agents related to the largest number of case fatalities (Listeria monocytogenes and Salmonella enterica), the food implicated in the outbreaks in which these pathogens were suspect or confirmed were summarized below.
In order to perform this analysis, the most frequent foods items were grouped in food categories (fish and fishery products, bakery products, milk and milk products, eggs and egg products, mixed or other foods, fruits and vegetables and meat and meat products).
# Listeria monocytogenes and Foods
df.species.f.1 <- df %>%
filter(str_detect(Species, "Listeria monocytogenes")) %>%
unnest_tokens(word, Food, token = 'regex', pattern=";", collapse = FALSE)
df.species.f.1$word <- trimws(df.species.f.1$word, which = c("left"))
df.species.f.1.count <- df.species.f.1 %>%
filter(word != "na") %>%
count(word, sort = TRUE)
# Salmonella enterica and Foods
df.species.f.2 <- df %>%
filter(str_detect(Species, "Salmonella enterica")) %>%
unnest_tokens(word, Food, token = 'regex', pattern=";", collapse = FALSE)
df.species.f.2$word <- trimws(df.species.f.2$word, which = c("left"))
df.species.f.2.count <- df.species.f.2 %>%
filter(word != "na") %>%
count(word, sort = TRUE)
#Listeria Plot
temp <- df.species.f.1 %>%
filter(word != "na")
temp$cat <-
ifelse(grepl("milk|cheese|queso|cream", temp$word), "milk and milk products",
ifelse(grepl("meat|deli|pate|chicken|hot|ham", temp$word), "meat and meat products",
ifelse(grepl("potato|peaches|lettuce|sprouts|celery|apple|cantaloupe|nectarine", temp$word), "fruits and vegetables",
ifelse(grepl("fish|tuna|sushi", temp$word), "fish and fishery products",
"Others"))))
ggplotly(
melt(temp %>%
group_by(cat) %>%
summarise(outbreaks = sum(n()), fatalities = sum(Fatalities))) %>%
ggplot(aes(x=cat, y=value, fill=variable)) + geom_bar(colour="black", size=.2, width=0.5, stat="identity", position=position_dodge()) + scale_fill_manual(values=c("#b8ddf2", "#eb724d")) +
ggtitle("Outbreaks and Fatalities due to L. monocytogenes by food categories") + theme(legend.position = "none", plot.title = element_text(face="bold", size=9)) + xlab("") + coord_flip()) %>%
layout(
yaxis = list(
ticktext = list("fish and fishery products", "fruits and vegetables", "meat and meat products", "milk and milk products"), showticklabels = TRUE, tickvals = list(1, 2, 3, 4),
tickmode = "array"))
# Table for Listeria
tabs1 <- temp %>%
group_by(cat) %>%
summarise(outbreaks = sum(n()), fatalities = sum(Fatalities)) %>%
mutate(outbreaks_per = round(outbreaks/sum(outbreaks), digits = 4)*100, fatalities_per = round(fatalities/sum(fatalities), digits = 4)*100)
colnames(tabs1)[1] <- "category"
formattable(tabs1)
| category | outbreaks | fatalities | outbreaks_per | fatalities_per |
|---|---|---|---|---|
| fish and fishery products | 2 | 3 | 4 | 2.44 |
| fruits and vegetables | 10 | 50 | 20 | 40.65 |
| meat and meat products | 14 | 41 | 28 | 33.33 |
| milk and milk products | 24 | 29 | 48 | 23.58 |
Of the 60 outbreaks in which Listeria monocytogenes was implicated, 50 had the food vehicle identified. The most frequent food category in outbreaks due to L. monocytogenes was milk and milk products (48%), then meat and meat products (28%), fruits and vegetables (20%) and fish and fishery products (4%).
However, regarding to case fatalities due to this pathogen, the fruits and vegetables category was the most frequent (40.65%).
#Salmonella Plot
temp2 <- df.species.f.2 %>%
filter(word != "na")
temp2$cat <-
ifelse(grepl("milk|cheese|queso|cream", temp2$word), "milk and milk products",
ifelse(grepl("meat|deli|pate|chicken|hot|ham|pork|turkey|carnitas|beef|bbq|carne|rib|lamb|goat|pig|steak|sausage|tripe|blood|duck|tartar|sausage", temp2$word), "meat and meat products",
ifelse(grepl("potato|peaches|lettuce|sprouts|celery|apple|cantaloupe|nectarine|tomato|chile|watermelon|cucumber|cilantro|bean|salad|guacamole|melon|mango|juice|grape|hummus|peanut|pepper|gallo|avocado|basil|blueberrie|carrots|mole|mushrooms|nuts|onion|papaya|fruit|vegetable|broccoli|chili|spice", temp2$word), "fruits and vegetables",
ifelse(grepl("fish|tuna|sushi|crab|oysters|salmon|lobster|bacalhau|seafood|shrimp|seabass", temp2$word), "fish and fishery products",
ifelse(grepl("egg|hollandaise|coleslaw|mayonnaise|omelette", temp2$word), "eggs and egg products",
ifelse(grepl("cake|toast|pie|pasta|lasagna|spaghetti|pudding|bread|canoli|bruschetta|biscuit|tamale|bagel|cannelloni|cookie|pastry", temp2$word), "bakery products",
"mixed or other foods"))))))
ggplotly(
melt(temp2 %>%
group_by(cat) %>%
summarise(outbreaks = sum(n()), fatalities = sum(Fatalities))) %>%
ggplot(aes(x=reorder(cat, value), y=value, fill=variable)) + geom_bar(colour="black", size=.2, width=0.5, stat="identity", position=position_dodge()) + scale_fill_manual(values=c("#b8ddf2", "#eb724d")) +
ggtitle("Outbreaks and Fatalities due to Salmonella enterica by food categories") + theme(legend.position = "none", plot.title = element_text(face="bold", size=9)) + xlab("") + coord_flip()) %>%
layout(
yaxis = list(
ticktext = list("fish and fishery products", "bakery products", "milk and milk products", "eggs and egg products", "mixed or other foods", "fruits and vegetables", "meat and meat products"), showticklabels = TRUE, tickvals = list(1, 2, 3, 4, 5, 6, 7),
tickmode = "array"))
# Table for Salmonella
tabs2 <- temp2 %>%
group_by(cat) %>%
summarise(outbreaks = sum(n()), fatalities = sum(Fatalities)) %>%
mutate(outbreaks_per = round(outbreaks/sum(outbreaks), digits = 4)*100, fatalities_per = round(fatalities/sum(fatalities), digits = 4)*100)
colnames(tabs2)[1] <- "category"
formattable(tabs2)
| category | outbreaks | fatalities | outbreaks_per | fatalities_per |
|---|---|---|---|---|
| bakery products | 92 | 3 | 5.38 | 3.80 |
| eggs and egg products | 158 | 3 | 9.23 | 3.80 |
| fish and fishery products | 79 | 2 | 4.62 | 2.53 |
| fruits and vegetables | 387 | 52 | 22.62 | 65.82 |
| meat and meat products | 655 | 13 | 38.28 | 16.46 |
| milk and milk products | 119 | 2 | 6.95 | 2.53 |
| mixed or other foods | 221 | 4 | 12.92 | 5.06 |
A total of 1400 outbreaks related to Salmonella enterica had the food vehicle identified. For this pathogen, the most frequent food category implicated in the outbreaks was meat and meat products (38.28%), followed by fruits and vegetables (22.62%).
However, regarding to case fatalities due to this pathogen, the fruits and vegetables category was the most frequent (65.82%).
As discussed in the item 4.4, Restaurant and Private home/Residence were the most frequent places of exposure to contaminated foods reported in the outbreaks of this dataset.
Therefore, it was investigated which causative agent most occurred in this two places.
df.spec.local <- df %>%
unnest_tokens(species, Species, token = 'regex', pattern=";", collapse = FALSE) %>%
unnest_tokens(location, Location, token = 'regex', pattern=";", collapse = FALSE)
df.spec.local$species <- trimws(df.spec.local$species, which = c("left"))
df.spec.local$location <- trimws(df.spec.local$location, which = c("left"))
df.spec.local.1 <- df.spec.local %>%
filter(location == "restaurant" | location == "private home/residence") %>%
filter(species != "na") %>%
group_by(location, species) %>%
summarise(occurence = n())
df.spec.local.1.top <- df.spec.local.1 %>%
filter(location == "restaurant" & occurence > 450)
df.spec.local.1.top <- df.spec.local.1.top[order(-df.spec.local.1.top$occurence),]
formattable(df.spec.local.1.top, list(occurence = color_bar("#b8ddf2")))
| location | species | occurence |
|---|---|---|
| restaurant | norovirus genogroup i | 1686 |
| restaurant | salmonella enterica | 1074 |
| restaurant | norovirus genogroup ii | 799 |
| restaurant | clostridium perfringens | 527 |
| restaurant | norovirus unknown | 479 |
| restaurant | bacillus cereus | 454 |
df.spec.local.2.top <- df.spec.local.1 %>%
filter(location == "private home/residence" & occurence > 100)
df.spec.local.2.top <- df.spec.local.2.top[order(-df.spec.local.2.top$occurence),]
formattable(df.spec.local.2.top, list(occurence = color_bar("#b8ddf2")))
| location | species | occurence |
|---|---|---|
| private home/residence | salmonella enterica | 559 |
| private home/residence | norovirus genogroup i | 279 |
| private home/residence | ciguatoxin | 200 |
| private home/residence | escherichia coli, shiga toxin-producing | 138 |
| private home/residence | norovirus genogroup ii | 128 |
| private home/residence | clostridium perfringens | 120 |
In restaurants, the most frequent causative agent was Norovirus genogroup I, followed by Salmonella enterica. The same agents, but in opposite position, was the most frequent in the outbreaks that took place in Private home/Residence.
There are 65,120 fields fulfilled with NA (not available) or “unknown” values, totaling 28% fields of this dataset. This lack of information is common in foodborne disease outbreaks data and constitutes an important limitation for its analysis.
The largest number of fatalities episodes occurred in 2011. This can be explained by the 2011 United States widespread listeriosis outbreak. The outbreak of Listeria monocytogenes food poisoning across 28 U.S. states resulted from contaminated cantaloupes linked to Jensen Farms of Holly, Colorado (https://www.cdc.gov/mmwr/preview/mmwrhtml/mm6039a5.htm?s_cid=mm6039a5_w).
The largest number of hospitalizations episodes occurred in 2008. This episode is linked with the 2008 United States salmonellosis outbreak across multiple U.S. states. The CDC investigation determined that jalapeño peppers imported from Mexico as well as Serrano peppers were the major sources of the outbreak. Tomatoes may have been a source as well (https://www.cdc.gov/salmonella/2008/raw-produce-8-28-2008.html).
The time series analysis showed that the highest number of hospitalizations and fatalities events occurred in July.
Restaurant was the most frequent reported place of exposure to contaminated foods, followed by Private Home/Residence.
Chicken was the most common food item identified in the foodborne disease outbreaks and was related to the highest number of hospitalizations.
Fruits and vegetables was the most implicated category in case fatalities.
Norovirus genogroup I was the most common agent implicated in outbreaks and illnesses. Salmonella enterica was the second one.
Listeria monocytogenes was the most common agent related to deaths, followed by Salmonella enterica.
In Listeria monocytogenes outbreaks, the most frequent food category reported was milk and milk products.
In Salmonella enterica outbreaks, the most frequent food category reported was meat and meat products.
I hope you enjoy this project…
If you have any question or suggestion, I will be appreciate to receive them…